fix(mcp): use tiktoken for response-size-guard token estimation#39912
Conversation
The MCP response-size-guard middleware estimates token counts to
decide when to truncate or block oversized tool responses. The
existing estimator used a simple char-to-token heuristic
(CHARS_PER_TOKEN = 3.5) that miscounts JSON-heavy MCP responses
relative to Claude's actual tokenizer:
estimate_token_count("a 80KB get_dataset_info response")
- 3.5 chars/token heuristic: ~22,800 tokens
- tiktoken cl100k_base: varies by content; closer to truth
This let payloads slip past the 25k token limit while still being
truncated by Claude Agent SDK's own threshold — the SDK saved them
into a file the model couldn't read back, causing 120s timeouts in
the eval suite.
Switch the estimator to tiktoken's cl100k_base encoding (a real BPE
tokenizer with a vocabulary similar to Claude's; tracks Claude's
counts within roughly ±10% for English and JSON-heavy content). The
char-based heuristic stays as a fallback for environments where
tiktoken is not installed; its ratio drops from 3.5 to 3.0 chars per
token to be more conservative for JSON content.
tiktoken is added as a dependency of the existing ``fastmcp`` extra,
so anyone installing ``apache-superset[fastmcp]`` gets it
automatically. requirements/base.txt and requirements/development.txt
are regenerated via scripts/uv-pip-compile.sh.
Tests:
- New unit tests cover the tiktoken-loaded path, the unavailable-
fallback path, and a defensive fallback when tiktoken's encode
raises (the size guard must never fail-open).
- One existing middleware test that depended on the old 3.5
chars/token heuristic is now decoupled by mocking
estimate_response_tokens directly.
Code Review Agent Run #c6406bActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #39912 +/- ##
==========================================
- Coverage 63.88% 63.88% -0.01%
==========================================
Files 2583 2583
Lines 136602 136617 +15
Branches 31501 31502 +1
==========================================
Hits 87274 87274
- Misses 47812 47827 +15
Partials 1516 1516
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Address review feedback on apache#39912: the lazy imports of estimate_response_tokens, format_size_limit_error, INFO_TOOLS, and truncate_oversized_response inside ResponseSizeGuardMiddleware methods made the conventional patch path (superset.mcp_service.middleware.estimate_response_tokens) raise AttributeError because those names didn't exist on the middleware module. Moving the imports to module level makes patching at "where it's used" work as expected, which is the standard mock convention. Tests now patch superset.mcp_service.middleware.estimate_response_tokens directly rather than the upstream definition module.
|
Addressed in f20e95b. Root cause: the middleware was lazy-importing Fix: hoisted all four imports to module level. Now Not strictly needed but the right call — keeps tests robust against future churn and gets rid of three pointless local imports. |
Code Review Agent Run #dfd6f2Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
The MCP response-size-guard middleware (
ResponseSizeGuardMiddleware) estimates token counts to decide when to truncate or block oversized tool responses. The existing estimator atsuperset/mcp_service/utils/token_utils.pyused a simple char-to-token heuristic (CHARS_PER_TOKEN = 3.5) that miscounts JSON-heavy MCP responses relative to Claude's actual tokenizer. Specific responses could slip past the configured token limit while still being truncated by the Claude Agent SDK's own threshold — the SDK then saved them into a file the model could not read back, causing 120s timeouts in tool calls likeget_dataset_infofor wide datasets.This PR switches the estimator to tiktoken's
cl100k_baseencoding — a real BPE tokenizer with a vocabulary similar to Claude's. For English and JSON-heavy content it tracks Claude's counts within roughly ±10%, which is far closer than any character-ratio heuristic.The previous heuristic stays as a graceful fallback for environments where tiktoken is not installed; its ratio drops from 3.5 → 3.0 chars/token to be more conservative for JSON content (which under-counted before).
BEFORE/AFTER
TESTING INSTRUCTIONS
New unit tests cover:
_ENCODING is None(tiktoken not installed) useslen/CHARS_PER_TOKENencoderaises — the size guard must never fail-openADDITIONAL INFORMATION
New dependency:
tiktoken>=0.7.0,<1.0added to thefastmcpextra inpyproject.toml. Anyone installingapache-superset[fastmcp]gets it automatically.requirements/base.txtandrequirements/development.txtregenerated viascripts/uv-pip-compile.sh.No network calls: tiktoken is pure offline tokenization. Anthropic's
count_tokensAPI is more accurate but adds a network roundtrip per tool result, which is too expensive for synchronous middleware.Behavioral change: previously-passing token estimates for the same content will now report different (more accurate) numbers. Sites relying on a specific cap will see different effective behavior — typically slightly more conservative truncation for English-text-heavy responses, slightly less for highly repetitive content (BPE compresses repetition).
Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
Introduces new feature or API
Removes existing feature or API